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ABSTRACT 

We  present  a  multi-resolution  approach  to  update 
and  refine  coarse  3D  models  of  urban  environments 
from  a  sequence  of  intensity  images  using  surface  par¬ 
allax.  A  coarse  and  potentially  incomplete  depth  map 
of  the  scene  obtained  from  a  Digital  Elevation  Map 
(DEM)  is  used  as  a  reference  surface  which  is  refined 
and  updated  using  this  approach.  We  first  estimate 
the  camera  motion  using  the  reference  depth  map. 
Using  the  estimated  camera  motion,  at  each  level  in 
the  multi-resolution  framework,  motion  of  3D  points 
on  the  reference  surface  is  compensated,  and  the  resid¬ 
ual  flow  field,  which  is  an  epipolar  field,  is  estimated 
and  used  to  refine  the  depth  map  at  that  level.  At  a 
coarse  resolution,  the  difference  between  the  reference 
depth  and  the  true  depth  will  be  small,  leading  to 
a  small  parallax  field.  The  refined  depth  map  from 
the  coarser  level  is  then  propagated  to  the  finer  level 
and  is  used  as  a  reference  depth  map  at  that  level. 
Thus,  significant  deviations  of  an  available  model 
from  a  true  model  can  be  handled  using  this  approach. 

1  INTRODUCTION 

There  has  been  considerable  interest  recently  in 
using  autonomous  mobile  robots  in  surveillance.  The 
ability  to  send  mobile,  sensor-equipped  robots  into 
environments  that  are  potentially  hazardous  to  hu¬ 
mans  is  of  vital  importance  in  a  number  of  scenarios 
(e.g.  nuclear/biological/chemical  contamination). 
There  is  a  need  for  robust,  real-time  algorithms  that 
exploit  data  collected  by  sensors  mounted  on  the 
robots  in  order  to  improve  the  operators  awareness 
of  the  scene.  The  operator’s  control  station  often  has 
access  to  some  meta-data,  e.g.  elevation  data  of  the 
environment  in  which  the  robots  are  operating.  In 
such  a  situation,  it  would  be  very  useful  to  be  able  to 
integrate  video  from  the  robots  with  elevation  data 
to  provide  the  operator  with  a  more  accurate  picture 
of  the  environment.  Elevation  data  is  often  available 
in  the  form  of  a  Digital  Elevation  Map  (DEM) ,  which 


gives  the  elevation  of  terrain  over  a  geographical  area. 
Thus  the  available  DEM  can  be  used  to  obtain  the 
reference  surface  (depth  map)  of  the  scene.  In  general, 
these  depth  maps  are  coarse  and  may  contain  partial 
information  about  the  area  due  to  structural  changes 
(e.g.  construction,  demolition  of  buildings).  This 
coarse  reference  surface  can  be  updated  and  refined 
using  information  from  a  sequence  of  2D  images  of 
the  scene.  The  enhanced  scene  can  provide  a  remote 
operator  a  better  understanding  of  the  scene  in  which 
robot  is  operating.  In  addition,  changes  in  urban 
environments  such  as  addition  of  new  buildings,  de¬ 
molition  of  old  buildings  or  other  structural  changes, 
can  be  incorporated  in  the  DEM  without  requiring 
additional  dedicated  DEM  data  collection. 

2  THEORY 

For  any  two  views  of  a  scene  under  perspective  pro¬ 
jection,  if  the  motion  of  the  3D  points  on  a  surface  is 
compensated,  the  resulting  parallax  field  is  an  epipo¬ 
lar  field.  Referring  to  Figure  1,  let  Cj  and  C2  rep¬ 
resent  the  camera  center  for  two  views  and  S  be  the 
reference  surface  which  is  aligned.  Let  Q  be  the  3D 
point  on  the  reference  surface,  P  be  the  true  loca¬ 
tion  of  the  3D  point  and  the  projection  of  these  points 
in  reference  image  Ci  be  q  and  p  respectively.  The 
residual  parallax  can  be  shown  to  be  equal  to  (Kumar, 
1994;Agrawal,  2004) 

x  Tz{Qz~Pz)/ 

6u  =  q~v=  Q,(P,-T,)ip-e)  (1) 

where  e  denotes  the  epipole  and  Tz  denotes  the  trans¬ 
lation  in  Z  direction.  If  Tz  =  0, 

,  -f(Qz  -  Pz ) 

5u  =  q-p=  - - (t)  (2) 

where  f  is  the  focal  length  and  t  =  [Tx ,  Ty] T  denotes 
the  2x1  translation  vector  in  x,y  space.  Without 
loss  of  generality,  for  the  rest  of  the  paper  we  assume 
Tz  >  0. 
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Figure  1:  Parallax  due  to  surface  S 


Since  (1)  has  the  unknown  correspondence  p  on  the 
right  hand  side,  it  is  solved  for  parallax  in  terms  of  q 
as 

x  TZ(QZ  —  Pz) 

=  (3) 

Then  f3  =  ||(g  —  e)||  denotes  the  parallax 

magnitude  and  v  =  denotes  the  parallax  di¬ 

rection.  The  true  depth  Pz  can  be  estimated  using  the 
parallax  magnitude  as 


Pz  = 


TZQZ 


{Qz-Tz)+Tz 


II  (g— e 


(4) 


3  APPROACH 

Our  approach  uses  a  hierarchical  framework  to  align 
a  non-planar  surface  (hereby  referred  to  as  the  refer¬ 
ence  surface)  in  images  and  estimate  the  deviations 
from  the  reference  surface  by  calculating  the  residual 
parallax  field.  The  algorithm  uses  two  frames  from 
the  image  sequence,  one  of  them  being  the  reference 
frame  for  which  the  depth  map  is  refined.  For  the 
rest  of  the  paper,  we  refer  to  the  reference  image 
as  key  image  and  the  second  image  as  the  offset  image. 

3.1  Estimating  Camera  Motion 

We  begin  by  first  estimating  the  camera  motion 
assuming  that  the  camera  calibration  is  known.  We 
identify  a  small  planar  region  in  the  3D  scene  (orien¬ 
tation  and  distance  in  the  camera  coordinate  system) 
using  the  reference  depth  map  and  its  corresponding 
region  in  the  key  image.  Since  the  optical  flow  of  a 
planar  surface  is  parametric  (quadratic  in  image  pix¬ 
els),  we  fit  a  parametric  optical  flow  (Bergen,  1992) 
to  the  region  and  obtain  the  parameters  for  that  re¬ 
gion.  For  a  planar  surface,  the  relationship  among  the 
optical  flow  parameters,  the  orientation  and  distance 
of  the  plane  from  the  origin  and  motion  parameters 


is  well  known  (Trucco,  1998).  Due  to  a  coarse  initial 
depth  map,  the  motion  parameters  estimated  using 
these  equations  may  not  be  very  accurate.  However, 
they  can  be  used  as  an  initial  estimate  for  refining  the 
motion  parameters  as  explained  below. 

Consider  the  equations  relating  the  image  motion 
of  a  rigid  body  with  depth  and  camera  motion  (Horn, 
1986) 


u(x,y )  = 


—  Xf  +  x 


1 


jxyQx 


(5) 


—  (/  +  —x  )fly  +  yklz 


v{x,y)  =  yfz+  y  +  (/+  jy2)V.x 
-  jxyfly  -  xClz 


(6) 


where  ( Xf,yf )  denotes  the  FOE  in  image  coordinates, 
(Qx,Qy,Qz)T  denotes  the  camera  rotation  velocities, 
Z'  is  the  scaled  depth,  Z'  =  and  (u,v)T  de¬ 

notes  the  2-D  velocities  according  to  the  reference 
depth  Zref.  We  use  the  initial  motion  estimate 
Xf,yf,iix,fly,flz  to  estimate  Z'  and  use  the  estimated 
Z'  to  refine  the  motion  estimates.  This  is  iterated  un¬ 
til  the  motion  estimates  are  stable  or  a  specified  num¬ 
ber  of  iterations  are  reached.  Finally,  we  obtain  an 
estimate  of  Tz  as  Tz  =  '  ^  where  ()  denotes  the 
averaging  operator  over  the  planar  surface. 

Note  that  the  FOE  values  are  not  affected  by  the 
estimated  Tz  because  we  are  refining  over  the  FOE 
values  first  and  then  estimating  Tz  using  the  refined 
scaled  depths.  In  fact,  since  depth  and  Tz  are  coupled, 
we  can  use  the  scaled  depth  throughout  our  algorithm 
without  explicitly  computing  Tz. 


3.2  Hierarchical  Framework 

Let  the  superscript  l  denotes  the  resolution  level,  i.e. 
xl  denotes  a  variable  at  level  l  with  l  =  1 . . .  L  where 


2 


L  is  the  coarsest  level.  Dividing  each  side  of  (3)  by  2l, 
we  get 


Su 

Y 


q_  _  P_ 
2l  2l 


Zk 

2l 

Pz 

2l 


(Qjl 
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(Qz 

y  21 


if  (Qlz 

Plz(Qlz 
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(7) 


where  Q'z  =  ^f,  Plz  =  |f  denotes  the  assumed  and 
true  depth  at  level  l  and  ql  =  $ ,  el  =  ^  are  the  image 
pixel  coordinates  at  level  l.  The  above  equation  shows 
the  relationship  between  the  parallax  and  depths  at 
level  l.  Thus  we  can  see  that  at  coarser  levels,  the 
parallax  field  is  small.  The  hierarchical  estimation  al¬ 
gorithm  proceeds  as  follows 


1.  Estimate  the  camera  motion.  Construct  pyramids 
for  the  key  and  offset  images  and  for  the  reference 
depth  map. 

2.  Initialize  l  =  L.  Use  the  reference  depth  map  and 
the  motion  estimates  at  coarsest  level  to  estimate 
the  parallax  field  (as  described  in  the  Appendix). 
Refine  the  depths  using  the  parallax. 

3.  Propagate  the  depths  to  level  /  —  1.  I  —>  l  —  1. 

4.  Warp  the  offset  image  according  to  the  propa¬ 
gated  depth  at  the  current  level  and  use  it  to  es¬ 
timate  the  parallax  field.  Refine  the  depths  using 
the  estimated  parallax. 

5.  Iterate  steps  3  and  4  until  1  =  1. 


4  EXPERIMENTS 

We  present  results  on  both  semi-synthetic  and  real 
world  3D  models.  In  all  experiments,  images  contain 
640  x  480  pixels. 

4.1  Semi-synthetic  Models 

For  semi-synthetic  models  (with  real  textures),  we 
rendered  a  3D  model  of  a  city  with  buildings  and  ob¬ 
jects  in  OpenGL.  We  simulated  a  sequence  of  images 
by  moving  a  virtual  camera  in  the  scene.  The  depth 
maps  were  obtained  from  the  OpenGL  Z  buffer.  The 
depth  maps  are  color  coded  (with  brighter  regions 
nearer  to  camera).  Figures  2(a)  and  2(b)  show  two 
frames  from  a  synthetic  image  sequence  respectively. 
Figure  2(c)  shows  the  true  depth  map  for  the  key  frame 
and  Figure  2(d)  shows  the  reference  depth  map  which 
was  used  as  a  surface  for  alignment.  The  background 
is  kept  at  a  depth  of  1000  units.  A  portion  of  ground 


Table  1:  True  and  estimated  motion  parameters  for 
semi-synthetic  example 


Tx 

Ty 

Tz 

wx 

Wy 

wz 

True 

-3.76 

0.60 

5.26 

0.02 

-1.29 

1.47 

Estimated 

-3.68 

0.57 

4.93 

0.01 

-1.34 

1.45 

Table  2:  Percentage  depth  error  between  the  true 


depth  map  and  the  reference  and  estimated  depth 
maps  using  different  number  of  levels  L 

Depth  Map 

Percentage  Depth  Error 

Reference 

35.59 

Estimated:  L  =  1 

23.94 

Estimated:  L  =  2 

16.21 

Estimated:  L  =  3 

03.74 

plane  was  used  for  camera  motion  estimation  as  ex¬ 
plained  in  Section  3.1.  The  true  and  estimated  cam¬ 
era  motion  parameters  are  as  shown  in  Table  1  (with 
rotation  angles  in  degrees). 

Figures  2(e),  2(f)  and  2(g)  show  the  estimated  depth 
maps  using  different  numbers  of  levels  L  in  multi¬ 
resolution  framework.  Notice  that  for  L  =  1,  the  depth 
of  the  portion  of  the  building  (in  the  center  of  the 
depth  map  image)  which  overlaps  with  the  building  at 
the  back  are  estimated  correctly,  whereas  for  the  por¬ 
tion  which  overlaps  with  the  background,  the  depths 
are  not  estimated  properly,  because  for  pixels  in  that 
region  the  parallax  magnitude  due  to  high  depth  dif¬ 
ference  (from  the  background)  is  much  higher.  The 
maximum  parallax  magnitude  at  levels  1 , 2  and  3  are 
10.37,5.17  and  2.57  pixels  respectively.  The  estimated 
depth  map  using  L  =  3  is  better  than  those  obtained 
using  L  =  1  and  2. 

We  define  the  relative  percentage  depth  error  be¬ 
tween  the  true  depth  map  ZtrUe  and  some  other  depth 
map  Z  as  100 x  Yl,i (Zt™‘~z)2  where  N  denotes  the 
total  number  of  pixels  in  the  image.  Table  2  gives  the 
percentage  depth  error  between  the  true  depth  map 
and  the  initial  reference  and  estimated  depth  maps 
using  different  numbers  of  levels  L.  Thus,  the  hierar¬ 
chical  approach  was  able  to  estimate  the  parallax  for 
regions  with  high  parallax  magnitude.  The  results  ob¬ 
tained  using  the  hierarchical  approach  are  much  better 
both  qualitatively  and  quantitatively. 

4.2  Real  World  Models 

A  DEM  model  of  downtown  Baltimore  (inner  harbor 
area)  was  rendered  in  OpenGL  and  the  reference  depth 
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(g)  L=3 


Figure  2:  Semi-Synthetic  example  (a)  Key  frame  (b)  Offset  frame  (c)  True  depth  map  for  key  frame  (d)  Reference 
depth  map  used  for  alignment  (e,f,g)  Estimated  depth  maps  using  different  levels  L 


4 


map  was  obtained  using  the  Z  buffer  as  shown  in  Fig¬ 
ure  3(c).  The  depth  map  is  color  coded  (with  brighter 
regions  closer  to  camera).  Video  images  were  captured 
using  a  Sony  camcoder  placed  on  a  cart  (not  mounted) 
moving  across  a  street.  Figures  3(a)  and  3(b)  show  the 
key  and  offset  frames  from  the  video  sequence  respec¬ 
tively.  Notice  that  the  reference  depth  map  is  quite 
coarse.  In  order  to  show  the  effectiveness  of  the  hier¬ 
archical  framework,  a  portion  of  the  reference  depth 
map  was  modified  to  a  very  small  depth  value  (shown 
in  Figure  3(d))  so  that  the  difference  in  depths  for 
that  portion  of  image  is  large  leading  to  large  paral¬ 
lax  values.  A  portion  of  ground  plane  was  used  for 
camera  motion  estimation.  Figures  3(e)  and  3(f)  show 
the  estimated  depths  using  the  hierarchical  algorithm 
for  L  =  1  and  3  respectively.  Notice  that  for  L  =  1, 
the  parallax  field  for  the  patch  where  the  depths  were 
modified  to  a  low  value  is  not  estimated  properly.  As 
a  result,  the  obtained  depths  are  not  correct  (they  are 
in  fact  much  closer).  For  L  =  3,  we  can  see  that  a  bet¬ 
ter  estimate  of  depths  is  obtained.  For  example,  the 
depth  of  the  pole  in  the  foreground  is  more  accurately 
recovered. 


CONCLUSIONS 

A  hierarchical  framework  for  refining  and  updating 
a  3D  model  given  a  coarse  depth  map  has  been  pre¬ 
sented.  The  approach  can  be  viewed  as  a  fusion  of 
available  depth  information  (metadata)  with  the  in¬ 
formation  from  intensity  images  obtained  from  mobile 
robots.  Results  on  both  semi-synthetic  3D  models  and 
real  models  were  presented.  The  estimated  depth  map 
is  quite  accurate  for  the  semi-synthetic  3D  model  and 
appear  plausible  for  the  real  model.  The  enhanced 
scene  can  provide  a  much  better  understanding  of  the 
scene  in  which  robot  is  operating. 
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APPENIDX 

Let  I(x,y,t)  and  I(x,y,t  —  1)  denote  the  key  and 
offset  frames  respectively.  Let  (u(x,  y),v(x,y))  denote 
the  true  optical  flow  of  pixel  (x,y)  in  the  key  image. 
The  optical  flow  can  be  decomposed  as 

u(x,  y)  =  uZref  (x,  y)  +  up(: r,  y)  ^ 

v(x,y)  =vZref(x,y) +vp(x,y) 

where  ( uzref,vzref )  denotes  the  flow  due  to  the  ref¬ 
erence  surface  Zref  at  level  l  and  (up,vp)  denotes  the 
parallax  due  to  Zref.  Assuming  brightness  constancy, 
we  have 

I{x,y,t)  =  I(x-uZref  -  Up,y-vZref  -vp,t-  1)  (9) 

Assuming  a  small  parallax  field,  we  make  the  approx¬ 
imation 

I(x  +  Up,y  +  vp,t)  =  I(x-uZref,y-vzref,t- 1)  (10) 

Expanding  the  left  hand  side  of  the  above  equation  in 
Taylor  series  around  (x,  y)  and  neglecting  higher  order 
terms,  we  have, 

Ixup  +  I y  v p  +  A I  =  0  (11) 

where  Ix  and  Iy  denote  the  spatial  image  gradients  and 
A I  represents  the  difference  between  the  key  image 
and  the  warped  offset  image  according  to  the  reference 
Z.  uzref  and  vZref  are  calculated  from  the  reference  Z 
and  motion  estimates  using  (??)  and  the  offset  image 
is  warped  towards  the  key  image  using  bilinear  inter¬ 
polation.  Since  we  know  the  camera  motion  and  hence 
the  FOE  ( Xf,yf ),  we  can  write  the  parallax  field  as 

up(x,  y)  =  /?( x,  y)du(x,  y) 
vP{x,y)  =  P(x,y)dv(x,y) 

where  ( du(x,y )  =  2/  ==  =^f  ,  dv(x,y)  = 

i/(x-xf)2  +  (y—yf)2 

( v-vf )  \  denotes  the  parallax  direction  and 

i/(x-xf)2+(y-yf)2' 

/3(x,  y)  denotes  the  parallax  magnitude  for  pixel  (. x ,  y). 
Equation  (11)  then  becomes 

P(x,y)Ip(x,y)  +  AI(x,y)  =  0  (13) 

where  Ip  =  Ixdu  +  Iydv  denotes  the  projection  of  the 
intensity  gradient  in  the  parallax  direction.  This  is  a 
linear  system  for  each  pixel  (x,y).  Assuming  that  the 
parallax  magnitude  is  constant  over  a  neighborhood 
N  x  N,  for  each  pixel  (x,y)  we  minimize  the  following 
error  function 

J(x,y)  =  min  E(x,s/)eJVxJV  <  PT d(x,y)g{x,y)Tp  > 
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(e)  L=1  (f)  L=3 


Figure  3:  Real  example  (a)  Key  frame  (b)  Offset  frame  (c)  Reference  depth  map  used  for  alignment  (d)  Modified 
reference  depth  map  (e,f)  Estimated  depth  maps 
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where  p  =  ^  ,  g(x,y )  =  ■^>(a:>2/)  anc[  <> 

[  v  J  av  [  A I(x,y) 

denotes  the  smoothing  operator  defined  as 

/OO 

w(x -x,y -y)f(x,y)dxdy  (14) 

-OO 

where  re  is  a  smoothing  function.  Then  the  parallax 
magnitude  will  be  given  by  f3(x,y)  =  A  To  avoid  the 
trivial  solution  p  =  0,  the  constraint  pTp  =  1  is  im¬ 
posed.  Using  Lagrange  multipliers,  the  error  function 
can  be  written  as 

J (x,  y)  =  min  V  <  pT g(x,  y)g{ x,  y)T p  > 

p  Z - ' 

(x,y)eNxN 

+  A(1  -  pTp) 

(15) 

Differentiating  with  respect  to  p,  we  get  Gp  =  A p 
where 

Q  =  I"  <  Ip{x,  y)Ip{x,  y)>  <  Ip(x,  y)AI(x,  y)  > 

[  <  Ip(x,y)AI(x,y)  >  <  AI(x,y)AI(x,y)  >  _ 


The  eigen-vector  corresponding  to  the  smaller  eigen 
value  of  G  will  be  the  solution  for  p  from  which  paral¬ 
lax  magnitude  can  be  estimated. 
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